Skip to content

Fix Usage event when a volume in allocated state is deleted#5901

Merged
DaanHoogland merged 5 commits into
apache:mainfrom
scclouds:fix-delete-event-of-volumes-in-allocated-state
Feb 21, 2022
Merged

Fix Usage event when a volume in allocated state is deleted#5901
DaanHoogland merged 5 commits into
apache:mainfrom
scclouds:fix-delete-event-of-volumes-in-allocated-state

Conversation

@SadiJr
Copy link
Copy Markdown
Contributor

@SadiJr SadiJr commented Jan 26, 2022

Description

When deleting a volume in allocated state, via API deleteVolume, no Usage event is generated, which causes an inconsistency in the Usage Server. Therefore, the volume appears as never deleted/destroyed.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

It's tested in a local lab:

  1. I created new volumes but I didn't attach them to any VM.
  2. I deleted this volumes via API deleteVolume.
  3. Before the changes, no usage event was generated and, in the cloud_usage DB, the new volumes appears only with the VOLUME.CREATE event. No VOLUME.DELETE event is handled, which causes the volume to be never "deleted/destroyed" in the usage system.
  4. With this PR, when deleting volumes in allocated state, the VOLUME.DELETE usage event is correctly handled.

@SadiJr SadiJr changed the title Fix Usage event when a volume in allocated state is deleted via API d… Fix Usage event when a volume in allocated state is deleted Jan 26, 2022
@DaanHoogland
Copy link
Copy Markdown
Contributor

@SadiJr your solution is great if it works. I does look a bit counter intuitive however; forbidding delete in allocated state leads to an event when deleting in allocated state. Can you add a short note on why this would work, please?

@SadiJr
Copy link
Copy Markdown
Contributor Author

SadiJr commented Jan 28, 2022

@DaanHoogland, when a volume is created, ACS generates a record in the table cloud.usage_event with the type VOLUME.CREATE. This record is synced to the table cloud_usage.usage_event and then, from it, ACS generates a record in the table cloud_usage.usage_volume. If the record in cloud_usage.usage_volume has field deleted as null, ACS will generate records in cloud_usage.cloud_usage, which means that the resource will be billed. When a volume is deleted, ACS generates a record in the table cloud.usage_event with the type VOLUME.DELETE, that is synced to the table cloud_usage.usage_event and then, from it, ACS mark the record in the table cloud_usage.usage_volume as deleted, which will stop the resource's billing.

This is the normal flow, however, when a volume in state allocated is deleted, it does not generate the record with type VOLUME.DELETE. This not happens because, before of executing the code that leads to the right behavior, ACS validates if the state of the volume is not contained by the Set STATES_VOLUME_CANNOT_BE_DESTROYED.

@DaanHoogland
Copy link
Copy Markdown
Contributor

ok, thanks @SadiJr , I got that. But is removing the ALLOCATED state from the list safe?
I.E. does the delete volume flow it goes through than result in nasty side effects? I suppose it was put in the list on purpose.

That all said, your code looks good and testing will have to prove it ;)

@SadiJr
Copy link
Copy Markdown
Contributor Author

SadiJr commented Feb 8, 2022

@DaanHoogland a more complex explanation:

  1. To work with volumes, ACS has a listener, called VolumeStateListener. This listener has a method called postStateTransitionEvent, which can generate a Usage Event in some transactional events. The transactional events that can generate Usage Events can be found in com.cloud.storage.Volume class.
  2. When creating a new volume (not attached to any VM), the API createVolume is used. This API call method allocVolume of class VolumeApiServiceImpl (which only alloc the volume in DB, without sending the createvolume cmd to hypervisor, this command is only sended when the volume is attached to a VM), and, this method calls the method commitVolume, which generates a VOLUME.CREATE Usage Event.
  3. So, when destroying a volume in allocated state, via API destroyVolume, the method destroyVolumeIfPossible of class VolumeApiServiceImpl will be called and, if the volume state isn't in the Set of volume states cannot be destroyed (which isn't the case of allocated state), the method destroyVolume of class VolumeServiceImpl is called.
  4. In this method, the volume state has changed to DestroyRequested (and then to Destroy) and then to OperationSucceeded.
  5. When volume changes is state to Destroy with OperationSucceeded event, the listener, in the postStateTransitionEvent method, generates a VOLUME.DELETE Usage event.
  6. But, in the normal execution, volumes in allocated state has only it state changed to DestroyRequested, but no state change event of type OperationSucceeded is used, which causes ACS, more specific the listener, to not generate a DELETE Usage Event.
  7. To fix this, there a lot of possibilities:
    • Don't generate a CREATE Usage Event if volume is not allocated to any VM;
    • Remove the allocated state from STATES_VOLUME_CANNOT_BE_DESTROYED Set;
    • In the destroyVolume method, when volume is in allocated state, make a transaction to DestroyRequested and, with success, make another transaction to OperationSucceeded, to make listener generates a DELETE Usage Event.
  8. In this PR, I first used the second option, but, after a new review, I chose to use the third

PS: I'm not sure I was clear in my answer. If you have any questions or suggestions, just let me know :)

Copy link
Copy Markdown
Member

@GabrielBrascher GabrielBrascher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SadiJr thanks for the PR.

I think that I understood the whole situation and, hopefully, I can also extend your (@SadiJr) explanation to @DaanHoogland. Please let me know if I am wrong.

So, the main issue is that CloudStack usage requires that the Volume transits to operation succeeded after the DestroyRequest (which makes sense, so it ensures there would be no garbage in a storage pool in case something goes wrong with the deletion process).

Easier to see this needed action when checking the Volume state machine:
s_fsm.addTransition(new StateMachine2.Transition<State, Event>(Destroy, Event.OperationSucceeded, Destroy, Arrays.asList(new StateMachine2.Transition.Impact[]{StateMachine2.Transition.Impact.USAGE})));

So indeed, the state machine transition from Destroy via the event OperationSucceeded impacts also in the usage new StateMachine2.Transition.Impact[]{StateMachine2.Transition.Impact.USAGE}))

Any volume in the "Allocated" State (The volume is allocated but has not been created yet) is simply an entry in the DB, so no reason to "not have" the Volume.Event.OperationSucceeded.
With that said, your code is LGTM.

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cltgm

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan package

@nvazquez nvazquez added this to the 4.17.0.0 milestone Feb 19, 2022
@blueorangutan
Copy link
Copy Markdown

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2658

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3392)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30683 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5901-t3392-kvm-centos7.zip
Smoke tests completed. 92 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Copy Markdown
Contributor

tests ok:
image

@DaanHoogland DaanHoogland merged commit e7082d9 into apache:main Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants